About Data Analysis Report

This RMarkdown file contains the report of the data analysis done for the project on forecasting daily bike rental demand using time series models in R. It contains analysis such as data exploration, summary statistics and building the time series models. The final report was completed on Fri Aug 15 12:37:57 2025.

Data Description:

This dataset contains the daily count of rental bike transactions between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information.

Data Source: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset

Relevant Paper:

Fanaee-T, Hadi, and Gama, Joao, ‘Event labeling combining ensemble detectors and background knowledge’, Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg

Objectives

The objective is to build a short-term forecasting model for daily rentals using an ARIMA model. Before ARIMA is implemented, the data is evaluated for stationarity - a pre-requisite for using ARIMA - and seasonal patterns.

Task One: Load and explore the data

Load data and install packages

Describe and explore the data

Step 1 - Import and clean the data

# Import the data
data <- read.csv("day.csv") # data is a data frame
# check structure of dataframe
str(data) # 731 rows and 16 variables
## 'data.frame':    731 obs. of  16 variables:
##  $ instant   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ dteday    : chr  "2011-01-01" "2011-01-02" "2011-01-03" "2011-01-04" ...
##  $ season    : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ yr        : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ mnth      : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ holiday   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ weekday   : int  6 0 1 2 3 4 5 6 0 1 ...
##  $ workingday: int  0 0 1 1 1 1 1 0 0 1 ...
##  $ weathersit: int  2 2 1 1 1 1 2 2 1 1 ...
##  $ temp      : num  0.344 0.363 0.196 0.2 0.227 ...
##  $ atemp     : num  0.364 0.354 0.189 0.212 0.229 ...
##  $ hum       : num  0.806 0.696 0.437 0.59 0.437 ...
##  $ windspeed : num  0.16 0.249 0.248 0.16 0.187 ...
##  $ casual    : int  331 131 120 108 82 88 148 68 54 41 ...
##  $ registered: int  654 670 1229 1454 1518 1518 1362 891 768 1280 ...
##  $ cnt       : int  985 801 1349 1562 1600 1606 1510 959 822 1321 ...
# convert dteday from character to date format
data$dteday <- as_date(data$dteday) 

# Replace numeric weekdays with labeled factor
data <- data %>%
  mutate(weekday = factor(weekday,
                          levels = 0:6,
                          labels = c("Sunday", "Monday", "Tuesday", "Wednesday", "Thursday", "Friday", "Saturday")))

Step 2 - Explore the data

glimpse(data)
## Rows: 731
## Columns: 16
## $ instant    <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
## $ dteday     <date> 2011-01-01, 2011-01-02, 2011-01-03, 2011-01-04, 2011-01-05…
## $ season     <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ yr         <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ mnth       <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ holiday    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,…
## $ weekday    <fct> Saturday, Sunday, Monday, Tuesday, Wednesday, Thursday, Fri…
## $ workingday <int> 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1,…
## $ weathersit <int> 2, 2, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, 1, 2, 1, 2, 2, 2, 2,…
## $ temp       <dbl> 0.3441670, 0.3634780, 0.1963640, 0.2000000, 0.2269570, 0.20…
## $ atemp      <dbl> 0.3636250, 0.3537390, 0.1894050, 0.2121220, 0.2292700, 0.23…
## $ hum        <dbl> 0.805833, 0.696087, 0.437273, 0.590435, 0.436957, 0.518261,…
## $ windspeed  <dbl> 0.1604460, 0.2485390, 0.2483090, 0.1602960, 0.1869000, 0.08…
## $ casual     <int> 331, 131, 120, 108, 82, 88, 148, 68, 54, 41, 43, 25, 38, 54…
## $ registered <int> 654, 670, 1229, 1454, 1518, 1518, 1362, 891, 768, 1280, 122…
## $ cnt        <int> 985, 801, 1349, 1562, 1600, 1606, 1510, 959, 822, 1321, 126…
# summary of the data to check for any weird values
summary(data)
##     instant          dteday               season            yr        
##  Min.   :  1.0   Min.   :2011-01-01   Min.   :1.000   Min.   :0.0000  
##  1st Qu.:183.5   1st Qu.:2011-07-02   1st Qu.:2.000   1st Qu.:0.0000  
##  Median :366.0   Median :2012-01-01   Median :3.000   Median :1.0000  
##  Mean   :366.0   Mean   :2012-01-01   Mean   :2.497   Mean   :0.5007  
##  3rd Qu.:548.5   3rd Qu.:2012-07-01   3rd Qu.:3.000   3rd Qu.:1.0000  
##  Max.   :731.0   Max.   :2012-12-31   Max.   :4.000   Max.   :1.0000  
##                                                                       
##       mnth          holiday             weekday      workingday   
##  Min.   : 1.00   Min.   :0.00000   Sunday   :105   Min.   :0.000  
##  1st Qu.: 4.00   1st Qu.:0.00000   Monday   :105   1st Qu.:0.000  
##  Median : 7.00   Median :0.00000   Tuesday  :104   Median :1.000  
##  Mean   : 6.52   Mean   :0.02873   Wednesday:104   Mean   :0.684  
##  3rd Qu.:10.00   3rd Qu.:0.00000   Thursday :104   3rd Qu.:1.000  
##  Max.   :12.00   Max.   :1.00000   Friday   :104   Max.   :1.000  
##                                    Saturday :105                  
##    weathersit         temp             atemp              hum        
##  Min.   :1.000   Min.   :0.05913   Min.   :0.07907   Min.   :0.0000  
##  1st Qu.:1.000   1st Qu.:0.33708   1st Qu.:0.33784   1st Qu.:0.5200  
##  Median :1.000   Median :0.49833   Median :0.48673   Median :0.6267  
##  Mean   :1.395   Mean   :0.49538   Mean   :0.47435   Mean   :0.6279  
##  3rd Qu.:2.000   3rd Qu.:0.65542   3rd Qu.:0.60860   3rd Qu.:0.7302  
##  Max.   :3.000   Max.   :0.86167   Max.   :0.84090   Max.   :0.9725  
##                                                                      
##    windspeed           casual         registered        cnt      
##  Min.   :0.02239   Min.   :   2.0   Min.   :  20   Min.   :  22  
##  1st Qu.:0.13495   1st Qu.: 315.5   1st Qu.:2497   1st Qu.:3152  
##  Median :0.18097   Median : 713.0   Median :3662   Median :4548  
##  Mean   :0.19049   Mean   : 848.2   Mean   :3656   Mean   :4504  
##  3rd Qu.:0.23321   3rd Qu.:1096.0   3rd Qu.:4776   3rd Qu.:5956  
##  Max.   :0.50746   Max.   :3410.0   Max.   :6946   Max.   :8714  
## 
## We can see that RH ranges from 0%. This is a physical impossibility on Earth and not conceivable in the humid subtropical climate of DC. It may be an artefact.
## All variables are treated as continuous variables in summary(), resulting in binary entries - either 0 or 1 - like workingday, having median values between 0 and 1, but that has no meaning.


# generate deeper EDA report
DataExplorer::create_report(data)
## 
## 
## processing file: report.rmd
##   |                                             |                                     |   0%  |                                             |.                                    |   2%                                   |                                             |..                                   |   5% [global_options]                  |                                             |...                                  |   7%                                   |                                             |....                                 |  10% [introduce]                       |                                             |....                                 |  12%                                   |                                             |.....                                |  14% [plot_intro]
##   |                                             |......                               |  17%                                   |                                             |.......                              |  19% [data_structure]                  |                                             |........                             |  21%                                   |                                             |.........                            |  24% [missing_profile]
##   |                                             |..........                           |  26%                                   |                                             |...........                          |  29% [univariate_distribution_header]  |                                             |...........                          |  31%                                   |                                             |............                         |  33% [plot_histogram]
##   |                                             |.............                        |  36%                                   |                                             |..............                       |  38% [plot_density]                    |                                             |...............                      |  40%                                   |                                             |................                     |  43% [plot_frequency_bar]
##   |                                             |.................                    |  45%                                   |                                             |..................                   |  48% [plot_response_bar]               |                                             |..................                   |  50%                                   |                                             |...................                  |  52% [plot_with_bar]                   |                                             |....................                 |  55%                                   |                                             |.....................                |  57% [plot_normal_qq]
##   |                                             |......................               |  60%                                   |                                             |.......................              |  62% [plot_response_qq]                |                                             |........................             |  64%                                   |                                             |.........................            |  67% [plot_by_qq]                      |                                             |..........................           |  69%                                   |                                             |..........................           |  71% [correlation_analysis]
##   |                                             |...........................          |  74%                                   |                                             |............................         |  76% [principal_component_analysis]
##   |                                             |.............................        |  79%                                   |                                             |..............................       |  81% [bivariate_distribution_header]   |                                             |...............................      |  83%                                   |                                             |................................     |  86% [plot_response_boxplot]           |                                             |.................................    |  88%                                   |                                             |.................................    |  90% [plot_by_boxplot]                 |                                             |..................................   |  93%                                   |                                             |...................................  |  95% [plot_response_scatterplot]       |                                             |.................................... |  98%                                   |                                             |.....................................| 100% [plot_by_scatterplot]           
## output file: C:/Users/d.wilkinson/Documents/R Portfolio/bikeshare_project_coursera/report.knit.md
## "C:/Users/DD9F6~1.WIL/AppData/Local/Pandoc/pandoc" +RTS -K512m -RTS "C:\Users\DD9F6~1.WIL\DOCUME~1\RPORTF~1\BIKESH~1\REPORT~1.MD" --to html4 --from markdown+autolink_bare_uris+tex_math_single_backslash --output pandoc3f5015546028.html --lua-filter "C:\Users\d.wilkinson\AppData\Local\R\cache\R\renv\cache\v5\windows\R-4.4\x86_64-w64-mingw32\rmarkdown\2.29\df99277f63d01c34e95e3d2f06a79736\rmarkdown\rmarkdown\lua\pagebreak.lua" --lua-filter "C:\Users\d.wilkinson\AppData\Local\R\cache\R\renv\cache\v5\windows\R-4.4\x86_64-w64-mingw32\rmarkdown\2.29\df99277f63d01c34e95e3d2f06a79736\rmarkdown\rmarkdown\lua\latex-div.lua" --lua-filter "C:\Users\d.wilkinson\AppData\Local\R\cache\R\renv\cache\v5\windows\R-4.4\x86_64-w64-mingw32\rmarkdown\2.29\df99277f63d01c34e95e3d2f06a79736\rmarkdown\rmarkdown\lua\table-classes.lua" --embed-resources --standalone --variable bs3=TRUE --section-divs --table-of-contents --toc-depth 6 --template "C:\Users\d.wilkinson\AppData\Local\R\cache\R\renv\cache\v5\windows\R-4.4\x86_64-w64-mingw32\rmarkdown\2.29\df99277f63d01c34e95e3d2f06a79736\rmarkdown\rmd\h\default.html" --no-highlight --variable highlightjs=1 --variable theme=yeti --mathjax --variable "mathjax-url=https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML" --include-in-header "C:\Users\DD9F6~1.WIL\AppData\Local\Temp\RtmpyqSRmA\rmarkdown-str3f5011a82518.html"
## 
## Output created: report.html
# shows no missing values
# month, season, weathersit and weekday are discrete variables - shown in univeriate distribution
## qq plot shows how normally the data is distributed. It is intended for continuous variables - discrete or categorical variables such as weekday or season - are not appropriate for such a plot type.

# What effect does temperature have on the average number of bikes being hired?
data %>%
  select(dteday, season, temp, weekday, cnt) %>%
  group_by(season, weekday) %>%
  summarise(avg_cnt = mean(cnt), mean_t_c = mean(temp * 41)) # adjusting the mean temperature to °C
## `summarise()` has grouped output by 'season'. You can override using the
## `.groups` argument.
## # A tibble: 28 × 4
## # Groups:   season [4]
##    season weekday   avg_cnt mean_t_c
##     <int> <fct>       <dbl>    <dbl>
##  1      1 Sunday      2229.     11.5
##  2      1 Monday      2453.     11.3
##  3      1 Tuesday     2793.     12.2
##  4      1 Wednesday   2611.     12.4
##  5      1 Thursday    2894.     13.1
##  6      1 Friday      2856.     12.9
##  7      1 Saturday    2432.     12.1
##  8      2 Sunday      4987.     22.4
##  9      2 Monday      4565      22.9
## 10      2 Tuesday     4825.     22.9
## # ℹ 18 more rows
# --> The hotter the temperature, the more bikes are being rented out

ggplot(data, aes(x = dteday, y = temp)) + geom_point() + labs(x = "Date", y="Temperature (°C)")

# the temperature fluctuates from approximately 6°C to 33°C over 2011. The lowest temperature seems larger at the end of than 2011 but no significance test has been run on it.

temp_adjusted = data$temp * 41

boxplot_temp_by_year <- data %>%
  mutate(
    temp_c = temp * 41
  ) %>%
  ggplot(aes(x = factor(yr), y = temp_c)) +
  geom_boxplot() +
  scale_x_discrete(labels = c("0" = "2011", "1" = "2012")) +
  labs(
    x = "Year",
    y = "Temperature (°C)"
  ) + theme_pubr()

boxplot_temp_by_year

# which peaks are offpeak, onpeak?

seasonal_peaks <- data %>%
  group_by(season) %>%
  summarise(total_rentals = sum(cnt), .groups = "drop") %>%
  arrange(desc(total_rentals))
seasonal_peaks
## # A tibble: 4 × 2
##   season total_rentals
##    <int>         <int>
## 1      3       1061129
## 2      2        918589
## 3      4        841613
## 4      1        471348
monthly_peaks <- data %>%
  group_by(mnth) %>%
  summarise(total_rentals = sum(cnt), .groups = "drop") %>%
  arrange(desc(total_rentals))
monthly_peaks
## # A tibble: 12 × 2
##     mnth total_rentals
##    <int>         <int>
##  1     8        351194
##  2     6        346342
##  3     9        345991
##  4     7        344948
##  5     5        331686
##  6    10        322352
##  7     4        269094
##  8    11        254831
##  9     3        228920
## 10    12        211036
## 11     2        151352
## 12     1        134933

Temperature range

In 2011, the temperature ranges from 2.4°C to 35.3°C with a median of 20.4°C. The temperatures are not significantly different in 2012 but appear to be slightly higher, especially when comparing the medians. The range of values in 2012 is also narrower.

Peak season

The month-based peak season is in August, June and September. There is a clear drop in bike rentals in April, November, March and December. In terms of yearly seasons, Fall is the peak season, followed by Summer, Winter and then Spring.

Task Two: Create interactive time series plots

In this task, three time series are plotted:

  1. the number of bikes rented over time
  2. the number of casual bike renters over time
  3. the number of casual bike renters by week day.
## Read about the timetk package
# ?timetk

# how many bikes are rented over time
riders_ts <- data %>%
  plot_time_series(data$dteday, cnt, .y_lab = "Number of Bikes")
riders_ts
# how many casual riders are there over time
casual_rider_ts <- data %>%
  plot_time_series(data$dteday, casual, .y_lab = "Number of Casual Riders")
casual_rider_ts
# how many casual riders are there over time by day of the week
casual_riders_by_weekday <- data %>%
  group_by(weekday) %>%
  plot_time_series(dteday, casual, .facet_ncol=2, .y_lab = "Number of Casual Riders", .title = "Number of Casual Riders vs Time by Week Day")
casual_riders_by_weekday
registered_riders_by_weekday <- data %>%
  group_by(weekday) %>%
  plot_time_series(dteday, registered, .facet_ncol=2, .y_lab = "Number of Registered Riders", .title = "Number of Registered Riders vs Time by Week Day")
registered_riders_by_weekday

Interpretation

There are consistently fewer casual riders taking trips on work days than on Saturdays and Sundays, suggesting casual users ride for leisure rather than commuting. There is a mid-year peak and at drop at the end of the year, consistent with seasonal leisure behaviour.

There are consistently more registered riders than casual riders. There is less fluctuation in the smoothed trend line across week days than the casual riders, especially Monday to Friday. This indicates that registered users are primarily commuters with regular weekday usage and reduced weekend activity. # Task Three: Smooth time series data

plot_time_series(
  .data       = data,
  .date_var   = dteday,
  .value      = cnt,
  .smooth     = TRUE,        # enables smoothing
  .interactive = FALSE,      # set TRUE for plotly
  .title      = "Smoothed Daily Bike Rentals Over Time",
  .y_lab      = "Number of Riders",
  .x_lab      = "Date"
)

Interpretation

The black line is for actual daily rental counts and it is spiky due to the daily variability. The blue, smoothed Loess line shows the underlying long-term trend. There is rising adoption in Jan-June 2011 which may reflect a recent launch of the bike hire system or be due to weather changes. There is a plateau between July 2011 and December 2011 where growth stabilises. The market may be saturated by early adopters, or the uptake is affected by the end of summer or other external influences like limits to infrastructure. There is another growth phase between Jan and Sept 2012 which may be due to increased membership, more bike infrastructure e.g. stations, better weather and changes in commuting habits. Then there is a decline in September-December 2012 which may be due to colder weather, public holidays or other external disruptions. The plot shows macro trends (growth, plateau, decline) but not micro patterns (weekly seasonality and effect of public holidays). Below, the effect of weather is incorporated into the modelling.

Task Four: Decompose and assess the stationarity of time series data

The goal of this task is to break the series into trend, seasonality, and noise components, then assess whether the series is stationary. Stationarity means the mean, variance, and autocorrelation structure do not change over time — a key assumption for ARIMA modelling. Non-stationary data can lead to misleading correlations and poor forecasts.

Step 1 - STL Decomposition We use seasonal-trend decomposition via Loess (STL) to identify the trend (long-term direction of data), seasonality (repeating patterns at fixed intervals e.g. weekly) and remainder (residual noise after removing trend and seasonality).

p_stl <- plot_stl_diagnostics(
  .data      = data,
  .date_var  = dteday,
  .value     = cnt,
  .frequency = "7 days",
  .trend     = "3 months",
  .title = "STL Decomposition of Daily Bike Rentals",
  .y_lab = "Count of Bike Rentals"
)
## frequency = 7 observations per 7 days
## trend = 90 observations per 3 months
decomp_tbl <- tk_stl_diagnostics(
  .data      = data,
  .date_var  = dteday,
  .value     = cnt,
  .frequency = "7 days",
  .trend     = "3 months"

)
## frequency = 7 observations per 7 days
## trend = 90 observations per 3 months

Step 2 - ACT/PACF of Remainder

p_acf_remainder <- plot_acf_diagnostics(
  .data     = decomp_tbl,
  .date_var = dteday,
  .value    = remainder,
  .lags     = 60,
  .title = "ACF & PACF of STL Remainder (Deseasonalised Residuals)"
)
p_acf_remainder

Interpretation - ACF shows a large spike at lag 1 (~0.9) with slow exponential decay over lags 2–10, suggesting strong autocorrelation and non-stationarity.

Step 3 - Stationarity Test (ADF) on Raw Series

Using Augmented Dickey-Fuller (ADF) test to statistically confirm stationarity.

# Test for stationarity on the dataset
adf.test(data$cnt)
## 
##  Augmented Dickey-Fuller Test
## 
## data:  data$cnt
## Dickey-Fuller = -1.6351, Lag order = 9, p-value = 0.7327
## alternative hypothesis: stationary
## data:  data$cnt
##Dickey-Fuller = -1.6351, Lag order = 9, p-value = 0.7327
##alternative hypothesis: stationary --> p-value > 0.05 so fail to reject null hypothesis --> series is likely non-stationary

Conclusion for Task 4 - The series shows strong autocorrelation and fails ADF for stationarity. - Differencing (first and seasonal) is needed before fitting ARIMA.

Task Five: Fit and forecast time series data using ARIMA models

Given the series is non-stationary, we apply differencing to achieve stationarity before fitting an ARIMA model. ARIMA has both seasonal and non-seasonal components:

General form: ARIMA(p, d, q)(P, D, Q)[s]

Component Meaning

p Non-seasonal autoregressive (AR) order. p = 1, uses 1 lag of pass values (AR(1)) “yesterday’s value” d Non-seasonal differencing to remove trend. d = 1, first difference applied to make the series stationary. q Non-seasonal moving average (MA) order. q = 1, uses 1 lag of the error terms (MA(1)) These parameters capture the short-term autocorrelation and trend

P Seasonal AR order. P = 1, 1 seasonal lag of past values i.e. value 7 days ago D Seasonal differencing order. D = 0, no seasonal differencing - data was already de-seasoned or did not require seasonal diff. Q Seasonal MA order. Q = 2, 2 seasonal lags of the error terms (i.e. errors from 7 and 14 days ago) [s] Season length (period of the seasonality, e.g., 7 for weekly seasonality). s = 7, weekly seasonality. This captures the repeating weekly patterns in the data.

Step 1 - Differencing and Stationarity Check

# run test again after differencing
cnt_diff1 <- diff(data$cnt, lag = 1)

adf.test(na.omit(cnt_diff1))
## Warning in adf.test(na.omit(cnt_diff1)): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  na.omit(cnt_diff1)
## Dickey-Fuller = -13.798, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary
## data:  na.omit(cnt_diff1)
## Dickey-Fuller = -13.798, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary --> p-value < 0.05 --> reject null hypothesis --> series is stationary --> can use ARIMA

# run after seasonal and first differencing (weekly seasonality = 7 day lag)
cnt_diff1_7 <- diff(cnt_diff1, lag = 7) 

adf.test(na.omit(cnt_diff1_7))
## Warning in adf.test(na.omit(cnt_diff1_7)): p-value smaller than printed p-value
## 
##  Augmented Dickey-Fuller Test
## 
## data:  na.omit(cnt_diff1_7)
## Dickey-Fuller = -16.339, Lag order = 8, p-value = 0.01
## alternative hypothesis: stationary
## data:  na.omit(cnt_diff1_7)
##Dickey-Fuller = -16.339, Lag order = 8, p-value = 0.01
##alternative hypothesis: stationary --> p < 0.05 --> likely stationary


# plot raw and differenced
# Raw series
acf_raw <- plot_acf_diagnostics(
  .data     = data,
  .date_var = dteday,
  .value    = cnt,
  .lags     = 60,
  .title    = "Raw Series: ACF & PACF"
)

acf_raw
# First + seasonal differencing
data_diff <- data %>%
  mutate(
    cnt_diff1 = c(NA, diff(cnt, lag = 1)),
    cnt_diff1_7 = c(rep(NA, 7), diff(c(NA, diff(cnt, lag = 1)), lag = 7))
  ) %>%
  drop_na(cnt_diff1_7)
head(data_diff)
##   instant     dteday season yr mnth holiday   weekday workingday weathersit
## 1       9 2011-01-09      1  0    1       0    Sunday          0          1
## 2      10 2011-01-10      1  0    1       0    Monday          1          1
## 3      11 2011-01-11      1  0    1       0   Tuesday          1          2
## 4      12 2011-01-12      1  0    1       0 Wednesday          1          1
## 5      13 2011-01-13      1  0    1       0  Thursday          1          1
## 6      14 2011-01-14      1  0    1       0    Friday          1          1
##       temp    atemp      hum windspeed casual registered  cnt cnt_diff1
## 1 0.138333 0.116175 0.434167  0.361950     54        768  822      -137
## 2 0.150833 0.150888 0.482917  0.223267     41       1280 1321       499
## 3 0.169091 0.191464 0.686364  0.122132     43       1220 1263       -58
## 4 0.172727 0.160473 0.599545  0.304627     25       1137 1162      -101
## 5 0.165000 0.150883 0.470417  0.301000     38       1368 1406       244
## 6 0.160870 0.188413 0.537826  0.126548     54       1367 1421        15
##   cnt_diff1_7
## 1          47
## 2         -49
## 3        -271
## 4        -139
## 5         238
## 6         111
acf_diff <- plot_acf_diagnostics(
  .data     = data_diff,
  .date_var = dteday,
  .value    = cnt_diff1_7,
  .lags     = 60,
  .title    = "Differenced Series: ACF & PACF"
)

acf_diff

Interpretation Before differencing : ACF: large spike at lag 1 with slow decay –> strong autocorrelation PACF: large spike at lag 1, others near 0 –> AR(1) behaviour –> non-stationarity

After differencing: ACF: immediate drop after lag 0, no prolonged decay –> reduced autocorrelation PACF: no significant spikes after lag 0 –> stationarity confirmed –> ARIMA can be used

Step 2 - Fit ARIMA model

# Modeltime-friendly version
model <- arima_reg() %>%
  set_engine("auto_arima") %>%
  fit(cnt ~ dteday, data = data)
## frequency = 7 observations per 1 week
model_tbl <- modeltime_table(model)

model_calibrated <- model_tbl %>%
  modeltime_calibrate(new_data = data)

Step 3 - Forecast with Confidence Intervals

model_calibrated %>%
  modeltime_forecast(h = "30 days", actual_data = data, .conf_interval_show = TRUE) %>%
  plot_modeltime_forecast(
    .conf_interval_alpha = 0.2,
    .title = "30-Day Forecast with 95% Confidence Intervals"
  )

Interpretation:

  • The 95% confidence interval (CI) indicates the range within which the true mean number of bike rentals is expected to fall 95% of the time.

  • CI is narrow in the short term (first week), moderately wide in weeks 2–3, and widens substantially after that — reflecting increasing uncertainty over time.

  • Historical data shows large swings (e.g., April and November 2012), so some volatility in the forecast is expected.

  • CI is symmetrical, suggesting no systematic bias.

Step 4 - Comparison of ARIMA and ARIMAX Models

library(dplyr)
library(timetk)
library(modeltime)
library(parsnip)

# 0) Clean & features ----------------------------------------------------------
data <- data %>%
  mutate(
    dteday = as.Date(dteday),
    temp_c = temp * 41
  ) %>%
  arrange(dteday)

# Baseline ARIMA (NO xregs, requires cnt ~ 1)
arima_model <- arima_reg() %>%
  set_engine("auto_arima") %>%
  fit(cnt ~ dteday, data = data)
## frequency = 7 observations per 1 week
# ARIMAX with xregs
arimax_model <- arima_reg() %>%
  set_engine("auto_arima") %>%
  fit(cnt ~ dteday + temp_c + hum + windspeed, data = data)
## frequency = 7 observations per 1 week
model_tbl <- modeltime_table(arima_model, arimax_model)

refit_tbl <- model_tbl %>%
  modeltime_refit(data = data)
## frequency = 7 observations per 1 week
## frequency = 7 observations per 1 week
refit_tbl %>%
  modeltime_forecast(
    new_data = data,
    actual_data = data
  ) %>%
  plot_modeltime_forecast(
    .title = "ARIMA vs ARIMAX In-Sample Fit",
    .conf_interval_alpha = 0.2
  )
## Warning: ✖ Expecting the following names to be in the data frame: .conf_hi, .conf_lo.
## ℹ Proceeding with '.conf_interval_show = FALSE' to visualize the forecast without confidence intervals.
## Alternatively, try using `modeltime_calibrate()` before forecasting to add confidence intervals.
# 5) Accuracy on the calibration window ---------------------------------------
refit_tbl %>%
  modeltime_calibrate(new_data = data) %>%
  modeltime_accuracy()
## # A tibble: 2 × 9
##   .model_id .model_desc                .type   mae  mape  mase smape  rmse   rsq
##       <int> <chr>                      <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1         1 ARIMA(1,1,1)(1,0,2)[7]     Fitt…  635.  57.9 0.870  17.2  915. 0.778
## 2         2 REGRESSION WITH ARIMA(2,1… Fitt…  542.  46.7 0.743  15.3  770. 0.842
# Inspect selected specs if curious
arima_model$fit
## Series: outcome 
## ARIMA(1,1,1)(1,0,2)[7] 
## 
## Coefficients:
##          ar1      ma1    sar1     sma1    sma2
##       0.3612  -0.9005  0.9106  -0.9035  0.0545
## s.e.  0.0427   0.0205  0.0760   0.0865  0.0417
## 
## sigma^2 = 844004:  log likelihood = -6014.88
## AIC=12041.75   AICc=12041.87   BIC=12069.31
arimax_model$fit
## Series: outcome 
## Regression with ARIMA(2,1,1) errors 
## 
## Coefficients:
##          ar1     ar2      ma1    temp_c         hum   windspeed
##       0.3570  0.0588  -0.8965  133.6327  -3340.5802  -3270.4342
## s.e.  0.0471  0.0431   0.0271   11.9336    237.0409    387.6349
## 
## sigma^2 = 598871:  log likelihood = -5888.77
## AIC=11791.54   AICc=11791.7   BIC=11823.7

Interpretation

ARIMA and ARIMAX models are compared in the figure. ARIMAX - the green line - tracks the shape of the actual time series (black). That means that the weather variables (temperature, humidity and wind speed) explain some of the variance. ARIMA is underperforming because - as a univariate model with strong seasonal influence - it is just modelling a long term trend.

The residuals on ARIMAX are lower than AIRMA, as is the AIC. It is the better model. The ARIMAX coefficients show that warmer weather means more rentals but higher humidity or higher wind speed mean fewer rentals. The absolute values of the coefficients are large and have relatively small standard errors, suggesting that the coefficients are statistically significant. The magnitudes are large, suggesting a strong effect. It may also success multicollinearity.

Task Six: Findings and Conclusions

1. Data Characteristics

  • The dataset contains 731 daily observations from the Capital Bikeshare system covering January 2011 to December 2012.

  • Features include date, weather, seasonal factors, and total bike rental counts (cnt), with casual and registered user breakdowns.

2. Trend and Seasonality

  • The smoothed time series plot of daily rentals (cnt) shows a clear upward trend over time.

  • STL decomposition revealed a weekly seasonal pattern, with higher rentals on specific days of the week.

  • However, the seasonal component is less dominant than the trend and random fluctuation components, especially in the first year.

3. Stationarity and Autocorrelation

  • ACF of the original series showed a strong lag-1 autocorrelation (~0.9) and a slow, exponential decay across further lags, indicating non-stationarity.

  • PACF showed a significant spike at lag 1 with all other lags close to zero — typical for an AR(1) process.

  • After differencing (first and seasonal), ACF/PACF diagnostics confirmed the series became more stationary and suitable for ARIMA modeling.

4. Model Fit Comparison: ARIMA and ARIMAX

ARIMA(1,1,1)(1,0,2)[7]:

  • This model captures trend and weekly cycles.

  • AIC = 12,041.75; residual variance (σ²) = 844,004

  • In-sample fit shows reasonable tracking but fails to adapt to late-2012 rental drops.

ARIMAX (with temperature, humidity, windspeed):

  • Regression with ARIMA(2,1,1) errors and weather covariates.

  • AIC = 11,791.54; residual variance = 598,871

  • Weather variables significantly improve explanatory power:

    • temp_c: positive influence

    • humidity and wind speed: strong negative effects

ARIMAX substantially outperforms ARIMA, indicating that weather is a critical driver of daily bike rental counts.

5. Forecasting

  • A 30-day forecast was generated using the ARIMA model.

  • The forecast captured the ongoing upward trend and preserved weekly seasonality in projected values.

  • The forecast interval provides useful insights into expected rental demand and uncertainty bounds.

6. Conclusion

  • The time series of daily bike rentals is non-stationary and shows both trend and weekly seasonality.

  • ARIMAX clearly outperforms ARIMA for short-term forecasting due to the inclusion of weather effects.

  • This model could be used to support short-term operational planning (e.g. staffing, bike redistribution) within the bikeshare system.

7. Recommendations for Future Work

  • Extend ARIMAX with lagged and interaction terms
  • Compare more models e.g. Prophet and XGBoost
  • Segment by user type (casual vs registered)
  • Longer term seasonality: consider monthly/yearly cycles with weekly aggregation.